text-to-image synthesis
fa64505ebdc94531087bc81251ce2376-Supplemental-Conference.pdf
In this work, we investigate the task of text-to-image (T2I) synthesis under the abstract-to-intricate setting, i.e., generating intricate visual content from simple abstract text prompts. Inspired by human imagination intuition, we propose a novel scene-graph hallucination (SGH) mechanism for effective abstract-to-intricate T2I synthesis. SGH carries out scene hallucination by expanding the initial scene graph (SG) of the input prompt with more feasible specific scene structures, in which the structured semantic representation of SG ensures high controllability of the intrinsic scene imagination. To approach the T2I synthesis, we deliberately build an SG-based hallucination diffusion system. First, we implement the SGH module based on the discrete diffusion technique, which evolves the SG structure by iteratively adding new scene elements. Then, we utilize another continuous-state diffusion model as the T2I synthesizer, where the overt image-generating process is navigated by the underlying semantic scene structure induced from the SGH module. On the benchmark COCO dataset, our system outperforms the existing best-performing T2I model by a significant margin, especially improving on the abstract-to-intricate T2I generation. Further in-depth analyses reveal how our methods advance.2
StyleDrop: Text-to-Image Synthesis of Any Style
Pre-trained large text-to-image models synthesize impressive images with an appropriate use of text prompts. However, ambiguities inherent in natural language, and out-of-distribution effects make it hard to synthesize arbitrary image styles, leveraging a specific design pattern, texture or material. In this paper, we introduce, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. StyleDrop is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. StyleDrop works by efficiently learning a new style by fine-tuning very few trainable parameters (less than 1\% of total model parameters), and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a image specifying the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. More results are available at our project website: https://styledrop.github.io .
Compositional Image Synthesis with Inference-Time Scaling
Ji, Minsuk, Lee, Sanghyeok, Ahn, Namhyuk
ABSTRACT Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge re-ranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. Index T erms-- text-to-image synthesis, inference-time-scaling, object-centric 1. INTRODUCTION Text-to-image (T2I) diffusion models now deliver striking realism and diversity from textual prompts [1, 2, 3, 4], yet they still struggle with compositionality: the precise rendering of object counts, attributes, and spatial relations [5].
Implicit Inversion turns CLIP into a Decoder
D'Orazio, Antonio, Briglia, Maria Rosaria, Crisostomi, Donato, Loi, Dario, Rodolà, Emanuele, Masi, Iacopo
CLIP is a discriminative model trained to align images and text in a shared embedding space. Due to its multimodal structure, it serves as the backbone of many generative pipelines, where a decoder is trained to map from the shared space back to images. In this work, we show that image synthesis is nevertheless possible using CLIP alone -- without any decoder, training, or fine-tuning. Our approach optimizes a frequency-aware implicit neural representation that encourages coarse-to-fine generation by stratifying frequencies across network layers. To stabilize this inverse mapping, we introduce adversarially robust initialization, a lightweight Orthogonal Procrustes projection to align local text and image embeddings, and a blending loss that anchors outputs to natural image statistics. Without altering CLIP's weights, this framework unlocks capabilities such as text-to-image generation, style transfer, and image reconstruction. These findings suggest that discriminative models may hold untapped generative potential, hidden in plain sight.
Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
Although text-to-image (T2I) models exhibit remarkable generation capabilities,they frequently fail to accurately bind semantically related objects or attributesin the input prompts; a challenge termed semantic binding. Previous approacheseither involve intensive fine-tuning of the entire T2I model or require users orlarge language models to specify generation layouts, adding complexity. In thispaper, we define semantic binding as the task of associating a given object with itsattribute, termed attribute binding, or linking it to other related sub-objects, referredto as object binding. We introduce a novel method called Token Merging (ToMe),which enhances semantic binding by aggregating relevant tokens into a singlecomposite token. This ensures that the object, its attributes and sub-objects all sharethe same cross-attention map.